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Abstract - An analytical formula for the probability distribution of stock-market returns, derived from the Heston model 
assuming a mean- reverting stochastic volatility, was recently proposed by Dragulescu and Yakovenko in Quantitative 
Finance 2002. While replicating their results, we found two significant weaknesses in their method to pre-process the 
data, which cast a shadow over the effective goodness-of-fit of the model. We propose a new method, more truly capturing 
the market, and perform a Kolmogorov-Smirnov test and a x 2 test on the resulting probability distribution. The results 
raise some significant questions for large time lags — 40 to 250 days — where the smoothness of the data does not require 
such a complex model; nevertheless, we also provide some statistical evidence in favour of the Heston model for small 
time lags — 1 and 5 days — compared with the traditional Gaussian model assuming constant volatility. 
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1 Assessed model 

Standard models of stock-market fluctuations predict a normal (Gaussian) distribution for stock price log-returns [1]. 
However, the empirical distribution exhibits significant kurtosis with a greater probability mass in its tails and centre of the 
distribution [2]. In Quantitative Finance 2002 ([3]), Dragulescu and Yakovenko (DY) proposed an analytical formula for 
the probability density function (pdf ) of stock price log-returns, based on the Heston model [4] — a geometric Brownian 
motion for the log-returns time series coupled with a stochastic mean-reverting volatility. The resulting pdf is claimed to 
outperform the Gaussian on a large scale of time lags (t = 1,5, 20, 40 and 250 days). 

1.1 Heston model and DY formula 

This model starts with the usual assumption that the price St follows a geometric Brownian motion described by the 
following stochastic differential equation 

dS t = pS t dt + o-tStdW^ (1) 

where p is the trend of the market, 
at is the volatility, 
W t is a standard Wiener process. 
Log-returns r t = Log^ and centred log -returns x t — r t — /it are then introduced: 

dr t = (/x - -^)dt + y/vidW^ since a t = (2) 

and 

dx t = -jdt + ^T t dW t {1} (3) 

Then, instead of having a constant volatility er t = a as in the Bachelier-Osborne model [5, 6], the Heston model assumes 
the variance v t — of obeys the following mean-reverting stochastic differential equation 

dv t = - 7 (v t ~ 6)dt + kyfthdWP (4) 



where 9 is the long time mean of v t , 

7 is the relaxation rate of this mean, 

A: is a constant parameter called the variance noise, 

dW^ is another standard Wiener process, not necessarily correlated with dW^ . 
DY solve the forward Kolmogorov equation that governs the time evolution of the joint probability P t (x,v\vi) of having 
the log-return x and the variance v for the time lag t, given the initial value Vi of the variance 

They introduce a Fourier transform to solve analytically this equation, and obtain the following expression for the proba- 
bility distribution of centred log-returns x, given a time lag t: 

Pt{x) = — / dp x e ip * +F ^ (6) 



with 
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where T = 7 + ipkp x , 

p is the correlation coefficient between the two Wieners and W^ 2 \ 

n = y/r* + k*(p>-i Px ), 

7 , 0, k and p are the parameters of the Heston model. 
Eqn. (6), hereafter designated as the "DY formula", is the central result of DY's paper [3]. It gives the expected proba- 
bility distribution of centred log-returns x, given a time lag t. An asymptotic analysis 1 of Pt(x) shows that it predicts a 
Gaussian distribution for small values of \x\, and exponential, time dependent tails for large values of |a:|. 

To confront their formula with observed log-returns, DY take the Dow-Jones Industrial Average from January 04, 1982 
to December 31, 2001, and train the four parameters of the Heston model, 7, 9, k and /i, to fit the empirical distribution 
by minimising the following square-mean deviation error 

E = J2\logP;^)-logPt(x)\ 

for all available values of log-returns x, and time lags t = 1,5, 20, 40 and 250 days, where P t *(x) is the empirical prob- 
ability mass and P t (x) the one predicted by the DY formula. In their results, they set the correlation coefficient p to 
zero, since (i) their trained parameter p tramed is almost null (p tramed ~ 0), and (ii) they do not observe any significant 
difference, when fitting empirical data, between taking p tramed ~ or p — 0. Hence, they reduce the complexity of their 
formula. Minimising the deviation of the log instead of the absolute difference \P£(x) — Pt{x) \ forces the parameters to 
fit the fat tails instead of the middle of the distribution, where the probability mass is very high. 



A replication of their results, using the same dataset and the same method, is shown in Figure 1 . The pdfs for different 
time lags from 1 to 250 are shown each shifted upwards by a factor of 10 for clarity. Theoretically, this result is brilliant, 
since the authors obtain an analytic expression for the pdf Pt(x). Furthermore, their model (plain line) seems to fit the 
empirical data (dots) far better than the Gaussian (dash line), especially if we look at the fat tails. 

But we argue that the method they use to evaluate the goodness-of-fit of their model suffers from two major drawbacks 
when pre-processing the data. 

'See [3] Part VI 
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Figure 1 : Replication of the main result of DY, where outliers are trimmed and data are reused. 



1.2 Pre-processing the data 

Firstly, DY trimmed the log-returns time series, rejecting any value out of the boundaries presented in Table l. 2 By 
doing so, they remove most of the leptokurtosis of the original dataset (the positive peakiness of the pdf in the centre), 
which is precisely one of the discrepancies from the Gaussian that they should try to fit. 

time lag trimming boundaries 



1 


[-0.04 


04 


5 


[-0.08 


08 


20 


[-0.13 


15 


40 


[-0.17 


20 


80 


[-0.18 


25 


100 


[-0.20 


28 


200 


[-0.22 


38 


250 


-0.22 


44 



Table 1: Boundaries used by DY to trim the empirical log-returns time series. 

We visualise in Figure 2 the effect of trimming the data: all of the log-returns outside the boundaries, represented 
here by the two horizontal lines, were trimmed. We believe this way of trimming the data is unfair, because it removes 
information from the dataset. Given that the model is supposed to outperform the Bachelier-Osborne model, and specially 
to fit the kurtosis and the fat tails, removing extreme values (that belong to the fat tails, and produce kurtosis) is counter- 
productive. Even the normal distribution could fit the data quite well in these conditions [7]. To prove this, we compare 

2 This step is not mentioned in DY's paper. Before applying this trimming method, strange points used to appear in our results. Then we contacted 
the authors who informed us they had trimmed the log-returns time series using the boundaries specified in Table 1 . 
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DJIA1982, time lag=1 
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Figure 2: Boundaries used by DY to trim the log-returns, for daily returns. 



the kurtosis of trimmed and untrimmed empirical data in Table 2. 



time lag 


^untrimmed 


ktrimme 


1 


69.27 


1.40 


5 


19.68 


0.72 


20 


7.80 


0.43 


40 


6.02 


0.56 


250 


-0.33 


-0.53 



Table 2: Comparison of the kurtosis of trimmed and untrimmed empirical data. 



Obviously, the kurtosis disappears when data are trimmed, which makes the dataset considerably smoother. Back in 
1965, Fama [8] had already made similar criticism of Kendall's experiments [9], in which the latter considered outliers so 
extreme that he just dropped them. 

A second major concern is that DY use a single log-return time series of overlapping returns. For a given index / at a 
given period, let us say the Dow Jones Industrial Average from January 04, 1982, to December 31, 2001, and a given time 
lag r, let us say r = 5 days, the raw close price dataset closePrice is composed of n close prices, here n — 5050. When 
DY compute the log-returns dataset logReturns starting from closePrice, they obtain the following time series: 



logReturns — {rt\t € [1, n — r]} 



where r t — log Pt pJ , Vt 6 [1, r, 
In our example, we would have 



logReturns 



{ri,r 2 , ...,r„_ r } 

r; Pi+t , P%+t 
\ lo 9—^—Jog- 



Pi 



Pi 



i P" i 
<log- — } 

n — t 
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r, A , P7 , -P5O5O-, 

= \log—,log—,...,log— } 



Thus, they obtain a single dataset of n — r log-returns. We believe this way of computing the log-returns time series is 
biased, because it "re-uses" the data. Indeed, let us assume that a crash occurs at time t*. Then they will take into account 
this specific event r times exactly in their dataset in log-returns {r t »_ T , r t *_ T+ i, r t *_i}. Working on this time series 
will definitely fatten the tails, since every shock in the original price time series will appear exactly r times in the log- 
returns. Hence, the resulting log-returns time series is not an accurate representation of the market moves anymore; this 
gives a decisive advantage to their model compared with the Gaussian, since they train their parameters to fit specifically 
the fat tails. 

Finally, DY do not statistically test the goodness-of-fit of their formula, but rather rely exclusively on plots that truly 
look good. 

2 Goodness-of-fit tests 

For these reasons, we decided to keep the outliers in our dataset and to use multiple non-overlapping log-return time 
series. We perform our tests on the Dow-Jones Industrial Average, from January 04, 1982 to December 31, 2001. From 
the raw price time series, we derive r log-return time series logReturnsj (j e [1,t]) of cardinality m = [-], called 
"paths", instead of a single log-return time series logReturn of cardinality n — r. This method sounds more reasonable 
for very different reasons: firstly, when economists or traders talk about weekly returns (t — 5 days, holidays excluded), 
they mean returns from Mondays to Mondays (for instance), and not from Monday to Monday, Tuesday to Tuesday, etc. 
Secondly, this method gives a better view of the real market, since tails are not artificially fattened. And finally, this 
method will enable us to assess the statistical accuracy of the estimators we will compute. 

We can now perform our goodness-of-fit tests on our different models: the Gaussian (normPDF), the curve resulting from 
the DY formula (draguPDF), and a Neural Network 3 intentionally overfitting the data, that will be used as a benchmark 
(nnPDF). 

2.1 Kolmogorov-Smirnov Statistic 

DY claim that their model fits the empirical data of the concerned dataset better than the Gaussian for any time lag. To 
check this, we use the Kolmogorov-Smirnov Statistic, based on the maximal discrepancy between the expected and the 
observed cumulative distributions, for any log-return x. This statistic is suitable for testing only a simple hypothesis, for 
instance a Gaussian with known parameters [i and a, but not a composite hypothesis (a class of Gaussians, or a Gaussian 
with fi and a derivated from the sample dataset being tested). Unfortunately, whatever the model, we always derive the 
parameters (p and a for normPDF, 7, 9, k and [i for draguPDF, the weights and biases for nnPDF) from the initial 
dataset. By performing this test with parameters derived from the dataset being tested, we expect the statistic to be large 
enough to reject the simple hypothesis, and a fortiori the composite hypothesis ([10]). But if the value of the statistic Z is 
small enough to accept the simple hypothesis, it does not mean that we can accept the composite hypothesis. 

Methodology - For each time lag, we compute the log-returns dataset, and we divide it into paths. For each path 
and each model, we build the empirical cumulative density function empCDF and the expected CDF modelCDF {norm- 
CDF, draguCDF or nnCDF), and we compute the KS-statistic Z (see Figure 3 for an example). We present in Tables 
3, 4 and 5 the mean Z and standard deviation oz of Z over the different paths, and the associated p-value 4 interval 

p(Z + a z )<p(Z)<p(Z-a z ). 

Results - Firstly, we observe an important variance, over the different paths, in the statistic Z: the standard deviation 
az is not negligible in comparison with the mean Z. This is an evidence that the paths are not equivalent, which legit- 
imates, a posteriori, our sampling method. It comes from the high heterogeneity of the dataset, which makes our tests 

3 Even if draguPDF is supposed to fit the empirical distribution, empPDF, better than normPDF, we want to compare it with the best fit possible, the 
one obtained with a Neural Network. This Neural Network must be as simple as possible, but should fit the main characteristics of the empirical time 
series, fat tail and kurtosis. The structure chosen was the following: it is a feed-forward back-propagation network, with a five node hidden layer and a 
single node output layer. The transfer functions are respectively tansig and purelin, where tansig(n) = 1+e _^ t „ — 1 and purelin(n) = n. This 
structure appears to be a good trade-off between the complexity and the goodness of fit. 

4 The p-value is the probability of observing the given sample result under the assumption that the null hypothesis (the tested model) is true. If the 
p-value is less than the level of significance a, then you reject the null hypothesis. For example, if a = 0.05 and the p-value is 0.03, then you reject the 
null hypothesis. 
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DJIA1982, time lag = 5 days 
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Figure 3: Cumulative Density Functions for the different models, r = 5 days. 

less robust. But any test performed on this heterogeneous dataset would suffer from the same problem. Even though this 
apparent lack of consistency prevents us from drawing any strong and global conclusion, the knowledge of the mean and 
standard deviation Z ± az provides us with a fair overview of the statistic Z. 

On plots, the DY formula seems to fit the empirical cumulative distribution better than the Gaussian. But in fact, on 
average, both models are rejected for high frequencies (for r = 1 and 5 days) at the 0.01 level of significance. Even the 
Neural Network is rejected for a one day time lag. This rejection of the three models may come from the fact that this test 
is based on the maximum discrepancy between the empirical and the theoretical cumulative distributions, for any x. To 
pass this test, a model must fit the observed data sufficiently well everywhere, i.e. in the tails (problem of fat tails) and in 
the middle (problem of high kurtosis for high frequencies) of the distribution. 

We point out that even if the DY formula is rejected for a one day time lag, the statistic Z is smaller than the Gaussian one 
(0.109 vs 0.131), which is an indication that the model fits the data a bit better. For other time lags, the p-value are equiv- 
alent: both models are systematically rejected for 5 days (p <C 0.01), sometimes rejected for 20 days (p(Z + az) < 0.01, 
but p(Z — az) > 0.05) and never rejected for higher frequencies (p ^> 0.05). For medium and low frequencies, the fact 
that the simple hypothesis is not rejected does not guarantee that the composite hypothesis can be accepted. 

Conclusion - The Kolmogorov-Smirnov goodness-of-fit test rejects both the Gaussian and the DY formula for high 
frequencies (t = 1 and 5 days). For medium and low frequencies, we cannot come to a firm conclusion because of the 
theoretical limits of this test. To continue with the investigation, we need a more powerful statistical test that can be used 
even when the parameters of the model are derived from the dataset being tested. The \ 2 statistic is suitable in those 
conditions. 

2.2 x 2 Statistic 

The x 2 goodness-of-fit test, based on binned data, is a powerful statistical tool to test if an empirical sample comes 
from a given distribution. Contrary to the Kolmogorov-Smirnov test, it is designed to evaluate a composite hypothesis, i.e. 
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time lag 


Z±G Z 


p-values 


1 


0.131 






2.93e-75 






5 


0.081 ±0.020 


1.75e-09 


< 


2.98e-06 


< 


9.88e-04 


20 


0.089 ±0.013 


9.51e-03 


< 


0.036 


< 


0.112 


40 


0.104 ±0.020 


0.038 


< 


0.122 


< 


0.321 


80 


0.113 ±0.021 


0.199 


< 


0.385 


< 


0.649 


100 


0.113 ±0.021 


0.322 


< 


0.533 


< 


0.778 


200 


0.148 ±0.038 


0.339 


< 


0.630 


< 


0.917 


250 


0.170 ±0.047 


0.291 


< 


0.598 


< 


0.919 



Table 3: KS-Test on the Gaussian. 



time lag 


Z ±a z 


p-values 


1 


0.109 






1.2e-53 






5 


0.087 ±0.019 


2.08e-10 


< 


3.64e-07 


< 


1.48e-04 


20 


0.089 ±0.014 


8.75e-03 


< 


0.033 


< 


0.104 


40 


0.094 ±0.010 


0.125 


< 


0.211 


< 


0.337 


80 


0.116 ±0.018 


0.197 


< 


0.355 


< 


0.576 


100 


0.128 ±0.019 


0.215 


< 


0.372 


< 


0.585 


200 


0.163 ±0.048 


0.209 


< 


0.512 


< 


0.893 


250 


0.186 ±0.046 


0.224 


< 


0.481 


< 


0.816 



Table 4: KS-Test on the DY formula. 



time lag 


Z ±a z 


p-values 


1 


0.106 






3.27e-42 






5 


0.048 ±0.014 


1.64e-03 


< 


0.026 


< 


0.204 


20 


0.047 ±0.009 


0.430 


< 


0.615 


< 


0.778 


40 


0.071 ±0.059 


0.153 


< 


0.746 


< 


1 


80 


0.075 ±0.061 


0.729 


< 


0.919 


< 


0.995 


100 


0.076 ±0.039 


0.532 


< 


0.871 


< 


0.999 


200 


0.1 16 ±0.034 


0.126 


< 


0.453 


< 


0.932 


250 


0.137 ±0.041 


0.190 


< 


0.502 


< 


0.904 



Table 5: KS-Test on the Neural Network. 
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the parameters of the model can be derivated from the empirical dataset tested. This test is a good trade-off between the 
goodness-of-fit of a model (the better fit, the smaller the x 2 statistic) and its complexity (the more complex, the smaller 
p-value). Indeed, even if a model fits the empirical data very well, a too large complexity may penalise its p-value, so 
that it can still be rejected. Finally, to be meaningful, this test must be performed using relatively large bins, and a critical 
value of 5 expected observations per bin is regarded as a minimum. 

Methodology - If we perform this test with equal size bins, then the fats tails will be trimmed (there are less than 
5 expected log-returns per bin in the tails) and will not participate in the value of the statistic, making the test irelevant. 
Instead, we split the log-return axis into equal expected frequency bins, so that all of the log-returns participate in the 
value of the statistic. We use an expected frequency of 5 log-returns per bin. Unfortunately, this test cannot be performed 
for large time lags, because of the lack of data. Indeed, for the Dow-Jones index from 1982 to 2001 for instance, we 
have initially around 5000 close prices, which means that for a time lag of 250 days, each path will have only about 20 
log-returns. In those conditions, because of the critical value of 5 log-returns per bin, we will have at best 4 bins, which is 
too small to perform a relevant test. 

Results - We present our results of the \ 2 goodness-of-fit test in Tables 6, 7 and 8. The degree of freedom is given by 
df = noBins — 1 — to, where to is the number of parameters of the model (to = 2 for the Gaussian, to = 4 in the DY 
formula, and to = 11 for the Neural Networks if we count the weights and the biases). For large time lags, df becomes 
smaller and smaller because noBins decreases, as explained above. 

Concerning the Neural Network, we cannot perform this test for time lags higher than 40 days, or else the degree of 
freedom decreases to zero. This is due to the relatively high number of parameters (to = 11). With a structure even more 
complicated, we could not have performed the test at all, except for high frequencies. 

First we notice that the Neural Network's x 2 statistic is slightly smaller than the DY's one, itself smaller than the Gaus- 
sian's one, for all paths with a time lag from r = 1 to t = 80 days. It means that the Neural Networks fits empirical 
data better than the DY formula, which itself has a better fit than the Gaussian. But there is a price to pay, in terms of 
complexity: due to too many parameters (and then a lower degree of freedom), the p-value of the Neural Networks and 
DY formula are not systematically higher than the p-value of the Gaussian. And it is precisely the p-value that is used to 
accept or reject a model, not directly the x 2 statistic. 
If we look at the p-value in detail, we observe that 

• For t = 1, the three models are rejected at a 0.05 level of significance 

• For t = 5, only the Neural Network is systematically accepted. The Gaussian and DY formula are only accepted 
in the best situation (p(x 2 ) < 0.05 < p(x 2 — c x 2 )) 

• For t = 20, the three models are accepted and the DY formula is better than the Neural Network and the Gaussian 

• For r = 40 and r = 80, the DY formula is still accepted, but its p-value is smaller than the one of the Gaussian 

Conclusion - Thanks to the x 2 goodness-of-fit test, we can assert that the DY formula fits empirical data slightly 
better than the Gaussian, for high and medium frequencies. Nevertheless, both models are rejected for high frequencies 
(t = 1 and 5 days), at a 0.05 level of significance. In this sense, these results are consistent with the Kolmogorov-Smirnov 
goodness-of-fit test. 

We also observe a clear shift in the goodness-of-fit of the models around r = 40 days: the probability of accepting the 
Gaussian becomes larger than the probability to accept the DY formula (and even the Neural Network) due to the lower 
complexity of the Gaussian (two parameters instead of four and eleven respectively). 

To put it in a nutshell, using a complex model, such as the DY formula or a Neural Network, is only worthwhile for 
r = 1, 5 and 20 days. For lower frequencies (r > 40 days), the Gaussian is preferable because it is simpler. Given that 
for these frequencies, we had observed neither fat tails nor kurtosis in the empirical datasets, the Gaussian represents the 
best trade-off between goodness-of-fit and complexity. 
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time lag 


X 2 ± 


erX 2 


df 


p-values 


1 


1790 




1010 






6.29e-ll 






5 


255 


±30 


198 


5.38e-05 


< 


4.07e-03 


< 


0.0931 


20 


61 


±12 


47 


7.99e-03 


< 


0.0819 


< 


0.409 


40 


29.1 


±7.0 


22 


0.0295 


< 


0.141 


< 


0.451 


80 


10.4 


±4.6 


9 


0.0915 


< 


0.32 


< 


0.76 



Table 6: x 2 Test on the Gaussian. 



time lag 


x 2 ± 


9 

o-x 


df 


p-values 


1 


1420 




1000 






1.16e-04 






5 


244 


±26 


196 


332e-04 


< 


0.0108 


< 


0.133 


20 


48.5 


±11.5 


45 


0.0663 


< 


0.333 


< 


0.796 


40 


27.3 


±6.1 


20 


0.0301 


< 


0.126 


< 


0.385 


80 


9.7 


±4.4 


7 


0.049 


< 


0.206 


< 


0.624 



Table 7: x 2 Test on the DY formula. 



time lag 


X 2 ± 


crx 2 


df 


p-values 


1 


2230 




997 






0.0839 






5 


232 


±38 


189 


0.0559 


< 


0.346 


< 


0.817 


20 


45.9 


±11.1 


38 


0.0473 


< 


0.25 


< 


0.688 


40 


21.5 


±6.3 


13 


0.0057 


< 


0.0552 


< 


0.333 


80 


7.6 


±6.3 





NaN 


< 


NaN 


< 


NaN 



Table 8: x 2 Test on the Neural Network. 
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3 Conclusion 

The DY analytical formula is quite exciting, and would validate the Heston model, if the resulting expected distribution 
was similar to the empirical one. Our method of pre-processing the data is different from the one suggested by Dragulescu 
and Yakovenko, and, we believe, more truly captures the market. We find that the DY formula consistently outperforms 
the Gaussian in terms of best fit (the error between the model and the observed data), but a higher complexity is the price 
to pay: the number of parameters is greater in the Heston model, which enables the fitted pdf to exhibit kurtosis and 
fat tails when the empirical dataset does, but also penalizes it when it doesn't. Hence, in terms of goodness-of-fit (the 
trade-off between best fit and complexity), the Gaussian is preferable for low frequencies (40 < t < 250 days), since the 
empirical dataset is then quite smooth and does not exhibit kurtosis. For medium frequencies (t = 20 days), both models 
are accepted at a .05 confidence level. Finally, for high frequencies, although it performs better than the Gaussian, the DY 
formula is still rejected (t = 1 and 5 days), mainly because of the extremely high kurtosis of the observed data. 
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